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Abstract 

A new method is proposed in this paper to learn overcomplete dictionary from training data samples. Differing 
from the current methods that enforce similar sparsity constraint on each of the input samples, the proposed method 
attempts to impose global sparsity constraint on the entire data set. This enables the proposed method to fittingly 
assign the atoms of the dictionary to represent various samples and optimally adapt to the complicated structures 
■ underlying the entire data set. By virtue of the sparse coding and sparse PCA techniques, a simple algorithm is 

O ■ designed for the implementation of the method. The efficiency and the convergence of the proposed algorithm 

are also theoretically analyzed. Based on the experimental results implemented on a series of signal and image 

>: 

<N. data sets, it is apparent that our method performs better than the current dictionary learning methods in original 

\^ ■ dictionary recovering, input data reconstructing, and salient data structure revealing. 

o: 

^SJ , Index Terms 

Dictionary learning, signal reconstruction, sparse principle component analysis, sparse representation, structure 
Vh , learnmg. 

I. Introduction 

In recent years, there has been a significant interest in using sparse representation over a redundant 
dictionary as a driving force for various data processing tasks [l]-[5]. All of these applications capitalize 
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on the fact that salient features underlying a data sample (e.g., a signal or an image) can always be captured 
by it sparse representation over an appropriate dictionary. As such, the pre-specified dictionary is crucial 
to the success of the sparse representation model in practical applications. Most conventional studies use 
the "off-the-shelf dictionaries, such as the wavelet [6] and DCT bases [7], [8], to build a sparsifying 
dictionary based on a mathematical model of the data. Current studies, however, have demonstrated the 
advantages of learning an often overcomplete dictionary matched to the data samples of interest [4], [9], 
[10], [11]. 

The dictionary learning task is generally described in the following mathematical way: For a collection 
of data samples X = [ ] G R^^^, it is expected to find the dictionary D = [di, ^2, • ' ' ^ dm] G 

j^dxm ^^^^ atom number m is set larger than d, implying that the dictionary is redundant) through the 
optimization model [12] [13]: 

[7:\\^^-Da,\\l + XV{a,)] , (Pa) 
D,A n^-^ \Z I 

1=1 ^ ^ 

where the vector ai contains the representation coefficients of x^, and we denote the coefficient matrix 
as A = [q^i,q^2, • • • ,<^n] ^ i?"^^^. The objective function of (Pa) involves two elements considered in 
dictionary learning task: the expression error term, \\xi — Dai^, and the sparsity controlling term, 
V{oLi), with respect to each involved sample xi. The most widely utilized functions of P(a^) include the 
/o penalty 11^2 ||o, the /i penalty the MC penalty etc.. The task can also be accomplished by solving 

the following two alternative optimization problems: 

minllX-DAIlJ, s.t. V{ai) < k, V 1 < i < n, (P^) 

and 

n 

min V{ai) s.t. \\xi — DaiW^ ^ ^ 1 ^ ^ {Pe) 

1=1 

where the notion \\A\\f stands for the Frobenius norm, and || A||o counts the nonzero entries of the matrix 
or the vector A. The tuning parameters A, fc, and e in the models (Pa), (P/c), and (Pe) play a very important 



3 

role in the model performance. They intrinsically control the compromise between the expression error 
of the sparse representation and the sparsity of the representation coefficients of the final results. 

It should be noted that a uniform parameter A, k, or s formulated for the entire training data set X is 
specified in the current model (Px), (Pk), or (P^), respectively. Such formulation facilitates the parameter 
selection and algorithm construction against the model. The samples collected from applications, however, 
are always of varying interior structures. On one hand, some samples may be composed of complicated 
features and need to be very densely represented under the dictionary; while some might be of very 
simple structure and can be precisely represented with very sparse coefficient vectors. On the other hand, 
some collected samples may seriously deviate from the original (e.g., signals mixed with the analog-to- 
digital conversion errors or images transmitted through noisy channels) while some may be totally clean 
samples. This means that we should vary the parameter A, /c, or e with respect to different input samples 
to make the dictionary learning model adaptive to the intrinsic structures underlying the entire training 
samples. The uniform specification of the sparsity penalty A, the maximal representation sparsity /c, or 
the minimal representational error e of the conventional model (Pa). (Pk), or (P^), respectively, is thus 
not very reasonable, and always tends to conduct unstable performance of the model in applications. 

The purpose of this paper is to construct a new dictionary learning model that will not impose similar 
penalty or constraint on each input sample like the conventional methods. Instead, the model will specify 
a global sparsity constraint on the entire sample set, so that it will adaptively tune the representation 
sparsity of diverse samples and properly reveal the intrinsic structures underlying the entire data set. An 
efficient algorithm is specifically designed for the proposed model. It is efficient, easy to be implemented, 
and is expected to converge to a local optimum of the problem. By a series of experiments, it is verified 
that the proposed algorithm, in comparison with the current dictionary learning methods, can not only 
deliver more faithful dictionary underlying the input samples, but can also more precisely recover the 
original data under the dictionary. Besides, the intrinsic structure underneath the entire training sample 
set can be very impressively depicted via the different representation sparsities of the samples under the 
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learned dictionary. 

In what follows, related work in the literature is first reviewed in Section 11. Details of our algorithm 
and its basic model are then presented in Section III. The convergence and the computational complexity 
of the proposed algorithm are also evaluated in this section. The experimental results are given in Section 
IV for substantiation and verification. The paper is then concluded with a summary and outlook for future 
research. 

II. Related work 

Using sparse representations of data under an appropriately specified redundant dictionary has advanced 
multiple data processing tasks, and has drawn a lot of attention in the past decade or so. In the conventional 
studies, the "off-the-shelf dictionaries have always been employed in various applications. The typical 
ones include the Fourier [14], the wavelet [15]-[19], and the DCT bases [7], [8]. These bases have been 
applied to many practical problems, and sometimes tend to conduct nearly-optimal sparse representations 
of input samples. 

Recent research, however, has demonstrated the significance of learning an overcomplete dictionary, 
instead of a fixed one, matched to the input data of interest. Various algorithms along this line have been 
formulated in recent years. For example, the algorithm proposed by Olshausen and Field [20] can find 
sparse linear codes for natural scenes. The dictionary composed by these codes complies with the intrinsic 
features of the localized, oriented, bandpass receptive fields of the neurons of the primary visual cortex. 
The method of optimal directions (MOD), proposed by Engan et al. [21], is also an appealing dictionary 
training algorithm. It improves the efficiency of the work by Olshausen and Field [20] both in the sparse 
coding and dictionary updating stages. Through generalizing the K-means clustering process to alternate 
between sparse coding and dictionary updating, Aharon et al. [4], [9] designed K-SVD method. There 
are two versions of the method: one achieves sparse data representations under strict sparsity constraint 
(corresponding to the model (Pk), and is thus denoted as K-SVDp^) [4], and the other attains the dictionary 
by allowing a bounded representation error for each training sample (corresponding to the model (P^), 
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and is thus denoted as K-SVDpJ [9]. The method has empirically shown the state-of-the-art performance 
in some image processing applications. Some methods have further been constructed to improve the 
efficiency of the dictionary learning problem: Mairal et al. [22] proposed an online model to efficiently 
solve the dictionary learning problem; Jenatton et al. [23] used a tree structured sparse representation 
to give a linear-time computation of the problem; and Lee et al. [12] designed an algorithm to speedup 
the sparse coding stage of the problem, allowing it to learn larger sparse codes than other algorithms. 
Recently, some algorithms have also been developed to extend the capability of dictionary learning based 
on some specific motivations. For instance, Shi et al. [13] developed an algorithm for dictionary learning 
with non-convex while continuous MC penalty; Mairal et al. [24] established a discriminative approach, 
instead of the purely reconstructive methods, to build a dictionary. All of the aforementioned methods are 
addressed to the models (Px), (Pk) and (P^) introduced in Section I. 

For many real data processing applications, however, these models cannot fully tally with the practical 
data samples possessing intrinsic complicated structures, such as those with different content of features, or 
with spatial and/or spectral non-uniform noises. Along this line of research, Mairal et al. [25] addressed the 
case of removing nonhomogeneous white Gaussian noise of images by virtue of the a priori knowledge of 
the noise deviation at each pixel of the objective images. Such elaborate information about noise, however, 
generally cannot be attained in practice. Very recently, Zhou et al. [10], [11] designed a nonparametric 
Bayesian method, called the beta process factor analysis (BPFA) method, for dictionary learning. The 
method can learn a sparse dictionary in situ for input samples with spatially non-uniform noises, without 
having to know the a priori noise information. The method is thus used as one of the methods for 
comparison in our experiments. 

In this paper, we propose a new dictionary learning method to improve the capability of the current 
dictionary learning methods by imposing a global sparsity constraint on all training samples to enable 
optimal atom assignment to individual training samples and adaptation to their intrinsic structures. In what 
follows, we give a detailed analysis of the proposed model and the associated algorithm. 
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III. Dictionary learning under global sparsity constraint: model and algorithm 
A. Model: From local to global constraint on sparse representation 

The current dictionary learning models, namely {P\), (Pk), and (P^), enforce similar sparsity control 
parameter, including sparsity penalty A, sparsity constraint /c, or representation error bound e, for each 
involved sample. However, there are often counterexamples to such formulation in real applications, 
especially for data samples embedded with intrinsic heterogeneous sparsity structures. We take the image 
case as an instance, in which the training data set corresponds to the small local patches of the image 
in consideration. On one hand, the local parts of a real image may contain very different capacities of 
meaningful features, e.g., the region full of patterns with abundant textures, as compared to the area 
located at the background with small grayscale variations. In such case, smaller sparsity penalty A of 
(Px), or larger representation sparsity k of (Pk), should be preset for the local patches located at the 
former region so that more atoms of the dictionary can be assigned to them. This phenomenon can easily 
be understood via the representations of patches A and B located at the house eave and the background 
parts of the "house" image, respectively, as shown in Fig. [B On the other hand, the real noise mixed in the 
image is often of significant statistical heteroscedasticity. This means that the extents of noise corruption 
in various parts of the image, such as patches C and D in Fig. [B may be significantly different. It is 
easy to see that patch C is highly corrupted by noise while D is almost clean and contains essentially no 
noise. Larger representation error bound e of (P^) should then be set for the former patches so that the 
model can properly capture the variation of noise corruption across the image. It is thus foreseeable that 
the performance of the current dictionary learning models can be substantially enhanced by adaptively 
regulating the sparsity control parameter(s) with respect to the underlying structural characteristics of the 
entire training samples (image patches). 

Based on the above rationale, we reformulate the model for dictionary learning into the following 
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The 'house' image with 
The 'house' image nonhomogeneous gaussian noise 




A 1= (-47.98)xB+(40.477)xB+(90.007)xB|+(219.23)xB+(-51 .76)xH+(66.646)xB 

+(-75.51 )x^+(-41 .61 )x||+(-67.39)xH+(-56.13)xH+(326.79)xB 
B ■= (-100.44)xB 

C^= ■ H- !^ (noise) 

D ■= ■ + ■ (noise) 

Fig. 1. The left figure is the 'house' image, and the right figure is the same image mixed with nonhomogeneous Gaussian noise. A,B,C,D 

are four patches cut from the two images, respectively. The upper two expressions at the bottom of the figure show the atoms and the 

coefficients utilized to sparsely represent the patches A and B over the dictionary learned from the algorithm proposed in Section III.B, 

respectively. The lower two ones demonstrate the ground truth of the noise separated from the patches C and D, respectively. 

global-sparsity-constraint form: 

mm\\X -DA\\l s.t. \\A\\,<K, (Pk) 

where K is the maximal size of the non-zero entries of the coefficient matrix A. Differing from the 
current models in which similar sparsity constraint is imposed on each input sample, our proposed model 
capitalizes on the global sparsity constraint superimposed upon the entire sample set. This formulation 
enables the model to adaptively assign different number of atoms, /c^, of the dictionary to represent 
each sample Xi according to its intrinsic structure. This can easily be understood through the following 
equivalent reformulation of (Pr)'- 

n 

min \\X-DA\\i s.t. llaAL < ki, V 1 < z < n, VA:^ = i^. (1) 

In specific, for input data samples containing different capacities of features, more non-zero atoms will 
be assigned to represent more complex samples by the proposed model; and for samples corrupted by 
the heterogeneous noises, the distribution of the non-zero entries of the coefficient matrix tends to be 
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optimally balanced among the entire sample set, and the sparse representations of the samples over the 
dictionary attained by the proposed model will possibly reveal the major information (the original data 
samples) while eliminate the minor (the mixed noises) at the global scale. The dictionary learning model 
(Pk) is thus expected to outperform the current models. 
We now construct an efficient algorithm for solving (Pr)- 

B. Algorithm: Iterative updating of the column and row vectors of the coefficient matrix 

The main idea of our algorithm is to iteratively update the column and row vectors of the coefficient 
matrix A to attain a local optimum of {Pk)- Denote the column and row vectors of the coefficient 
matrix A as ol\^' - - , a^] and [(c^O^^ (^^2)^5 ' ' ' -> (<^m)^]^' respectively. The column updating step is 
to update each column vector (z = 1, 2, • • • , n) of A, with the number of its non-zero entries, k\, fixed, 
while the column positions of these non-zero elements are optimally relocated in an adaptive way. By 
decomposing the model (Pk), the corresponding task is to solve the following optimization problem for 
each (i = 1, • • • , n): 

min \xi — Da^Wl W^iWo — ^t- (2) 

The row updating step is to update each row vector a[ (i = 1, 2, • • • , m) of ^4, with the sparsity /c[ of a\ 
fixed, while its k[ non-zero elements are optimally adapted to the proper row positions. Since the atom 
di of the dictionary D corresponds completely to in the sense that DA = (^I) [4], it is also 

i=l 

simultaneously updated in this step together with a^. The corresponding optimization is of the following 
form for each a[ (z = 1, • • • , m): 

min ll^;^ - di{al)^\\t s.t. < djdi = 1, (3) 

where Ei = X — ^dj (ce^)^, stands for the representation error of all considered samples with the effect 
of the z-th atom di removed. 

The algorithm for solving the optimization model {Pr) can then be summarized as Algorithm 1. 
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Algorithm 1 The Algorithm for dictionary learning under global sparsity constraint (GDL) 

Given: X = [xi, • • • , Xn] G R^^"^, the global sparsity K 

Execute: 

1. Initialize the dictionary D G R^^^ and the coefficient matrix A G R'^^^ with sparsity K, respectively. 

2. Repeat 

2.1 (Column updating). Update the column vector ai of A by solving ^ for each i = 1, • • • , n. 

2.2 (Row updating). Update the row vector a[ of A and the atom (i^ of D by solving (|3]) for each i = 1, • • • , m. 
Until the termination condition is satisfied 

Return: the solution D,A of (Pk)- 

Now the question is how to efficiently solve the optimization problems ^ and dS]). For (O, it is 
actually the well known Lg-norm model of sparse coding, and multiple effective algorithms have been 
investigated against this model. The typical ones, for example, include the thresholding methods, e.g., the 
hard algorithm [26], and the greedy methods, e.g., the OMP algorithm [27]. 

For Q, we give the following theorem: 

Theorem 1: The optimum of the model ([3]) can be attained by solving the optimization problem: 

max w^EjEiW s.t. w^w = l,\\w\\n.<k[, (4) 

w 

in the sense of 

where rf^, S[ are the optima of ([3]), and w is the optimum of Q, respectively. 
The proof of Theorem 1 is given in the appendix. 

It is very interesting that dH) is just the sparse principal component analysis or sparse PCA model [28], 
which has been thoroughly investigated in the last decade [28] -[32]. Many efficient algorithms have been 
constructed for solving the model, including SPCA [28], GPower [29], sPCA-rSVD [32] etc.. 

Thus, both models Q and ([3]) can be efficiently solved, i.e., both of the column and row updating steps 
of the proposed algorithm can be effectively implemented, by employing the existing methods in sparse 
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coding and sparse PCA, respectively. 

The remaining issues are then how to appropriately specify the initial dictionary D and the coefficient 
matrix A in step 1, and when to terminate the iterative process in step 2 of the proposed algorithm. In 
our experiments, D and A were simply initialized with data signals and sparse matrix with K non-zero 
elements, whose positions are randomly generated in the matrix, respectively. As for the termination of 
the algorithm, since the entire representation error of signals, i.e., the objective function of (Pr), decrease 
monotonically under the fixed global sparsity constraint throughout the iterative process (see details in 
the next section), the algorithm can reasonably be terminated when the decrease in value of the objective 
function is smaller then some preset small threshold, or the process has reached the pre-specified number 
of iterations. 

Now we investigate the convergence and the computational complexity of the proposed algorithm in 
the next section. 

C. Convergence and computational complexity 

We first discuss the convergence of the proposed algorithm, i.e., the iterations between the column 
and row updating steps of the algorithm. Under the assumption that models ([3]) and dH) can be precisely 
solved, each of the updating iterations in step 2 monotonically decreases the total representation error 
\\DA — X\\\. under the guarantee that the constraint \\A\\^ < K is consistently held. Therefore, a local 
minimum of Q is expected to be attained by the proposed algorithm. Although the above claim depends 
on the success of the sparse coding and sparse PCA techniques employed in approximating the solutions 
of ([3]) and d?]), respectively, the algorithms employed on the tasks performed well in our experiments and 
could empirically generate a rational solution of (O after multiple iterations, as demonstrated in Section 
IV. 

The computational complexity of the proposed algorithm is essentially determined by the iterative 
process between the column and row updating steps. By employing the recent sparse coding and sparse 
PCA technologies, e.g., OMP [27] and sPCA-rSVD algorithms [32], respectively, in our experiments, both 
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steps can be efficiently performed, requiring around 0{dnmk) x Iter computational cost, where k is the 
maximal value of fc^^s, and Iter is the iteration number of the algorithnu. That is, the computational time 
of the proposed algorithm increases linearly with the dimensionality and the size of the input samples, as 
well as the number of atoms in the dictionary. The computational complexity of the proposed algorithm 
is comparable to that of the current dictionary learning algorithms [9], [10], [20]. 

IV. Experimental results 

To test the effectiveness of the proposed algorithm on dictionary learning, it was applied to a series of 
synthetic signals and real images for substantiation. The results are summarized in the following discussion. 
All programs were implemented on the Matlab 7.0 platform. 

A. Synthetic signal experiments 

We first apply the proposed algorithm to synthetic signal data to test whether the algorithm can recover 
the generating dictionary and reconstruct the original signals. 

Two series of signal experiments were implemented, each involving 11 sets of signals. Each signal set 
contained 1500 20-dimensional signals, denoted as X = [xi, X2, • • • , xisoo] ^ i^^oxisoo^ which were created 
by a linear combination of a pre-specified dictionary D = [rfi, ^2, • ' * ^ <^5o] ^ i^^oxso representation 
coefficients A = [ai,a2,-'' 5<^i5oo] ^ /^soxisoo^ mixed with different extents of homogeneous 
Gaussian white noise. The entries of each dictionary D were first generated by random sampling, and 
each column (atom) di of D was then divided by its /2-norm for normalization. For the first series of 
experiments, each column ai of A contained 3 non-zero elements with randomly chosen values and 
locations. For the second series of experiments, each coefficient matrix A consisted of 4500 randomly 
valued and located non-zero entries. Added to the 11 signal sets in both series of experiments were 

^In each iteration of the proposed algorithm, the optimization problem ^ needs to be solved for i = 1, • • • , n, each requiring 0{dmki) 
cost from utilizing the OMP algorithm [9], [27], and problem ^ needs to be solved for i = 1, • • • 5 ^» ^^ch costing 0{dn) computation from 
employing the sPCA-rSVD algorithm [32]. Thus, the total computational complexity of the proposed algorithm is around 0{dnmk) x Iter. 
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homogeneous Gaussian noises with standard deviations varying from to 0.1 with interval 0.01. The 
SNR values of these signal sets ranged from infinity to around 10, corresondingl)!^. It should be indicated 
that for the first experimental series, the signals in each experiment were of similar representation sparsity 
over the preset dictionary, which complies with the preassumption of the model Pk', and for both series of 
experiments, the signals were corrupted with homogeneous noises, which tallies with the preassumption 
underlying the model P^. 

Four of the current dictionary learning methods, including K-SVDp^ [4], K-SVDp^ [9], MOD [21], and 
BPFA [10], were applied to these signal sets for comparison. The dictionary of the first three methods, 
as well as the proposed method, were initialized as the randomly selected signals from the input set, 
and the initialization of the last method was based on the singular value decomposition technique [10]. 
Since both MOD and K-SVDp^ need the a priori deviation of the noise mixed in the signals to preset 
the representation error parameter, we directly used the ground truth information to optimally specify the 
parameter value [9]. For K-SVDp^, the sparsity constraint parameter k was set as the real sparsity in the 
first experimental series. For the second series, it was specified by running the method 5 times on each 
signal set under different fcs, and selecting the best one as the final output. All parameters involved in 
the BPFA were automatically inferred by using a full posterior on the model [10]. For the proposed GDL 
method, the global sparsity K was set as 4500 for all experiments. The results of all involved methods 
were attained after 100 iterations. 

Two criteria are utilized to assess the performance of the employed methods for dictionary learning. 
The first is computed by sweeping through each atom of the generating dictionary and checking whether 
it is recovered by the dictionary attained by a utilized method via the following formula [4]: 

l-\djd,l (6) 

where di is the atom of the original dictionary and di is its corresponding closest element in the recovered 

^The signal set mixed with Gaussian noise with deviation means that the set is clean and contains no noise. The corresponding value 
of SNR is infinity. 
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dictionary. If the value of © was less than 0.01, then it was considered as a success. The rate of the 
successfully recovered atoms in the generating dictionary (called the dictionary recovery rate, or DR in 
brief) is then taken as the first criterion, which evaluates the capability of the method in delivering the 
original dictionary beneath the input signals. The second criterion is the mean of the standard deviations 
of the reconstructed signals from the original signals (called the representation error, or RE briefly). This 
value assesses the performance of the method in recovering the input signals. 

In all of the implemented experiments, the DR and RE values in the iterative processes of the four 
current methods and the proposed method were recorded. The upper and lower panels of Fig. [2l depict the 
RE and DR curves of the five methods in the iterative processes of three of the first series of experiments, 
respectively. Fig. [3] shows the corresponding results in three cases of the second series of experiments. 
Panels (a) and (c) of Fig. |4] show the final RE and DR values of the five methods in 11 experiments 
of the first series, respectively. For easy comparison, panels (b) and (d) of the figure display the mean 
values of RE and DR of the five methods as vertical bars. Fig. [5] depicts the cases of the second series 
of experiments. 

It can easily be observed from the upper panels of Figs. [2l and [3] that the RE values obtained by the 
proposed method tend to decrease monotonically throughout the iterative process. Besides, after around 
10 iterations, the GDL algorithm attains the smallest (5 out of 6 depicted cases) or the second smallest 
RE values among the five methods. In the final output, the GDL algorithm also achieves the smallest (7 
out of 1 1 experiments in the first series and 8 out of 1 1 experiments in the second series) or the second 
smallest RE values in all of the experiments, as clearly shown in Fig. Ua) and Fig. Oa). On the average, 
the proposed algorithm outperforms the other four methods in both series of experiments, as shown in Fig. 
Ub) and Fig. Ob). This demonstrates the excellent capability of the proposed algorithm in reconstructing 
the input signals. 

Furthermore, from the lower panels of Figs. [2] and [3l it can be observed that the DR curves of the 
proposed method tend to increase monotonically in all experiments, and the method obtains the largest (4 
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Fig. 2. The upper panels: The RE curves of the K-SVDp^, K-SVDp^, MOD, BPFA, and the proposed GDL methods in the iterative 
processes in three of the first series of experiments. The lower panels: the corresponding DR curves. It should be noted that the BPFA fails 
to take effect on the signals in the sense that the method always obtains RE values larger than 0.2 and zero DR values. 
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Fig. 3. The upper panels: The RE curves of the five methods in the iterative processes in three of the second series of experiments. The 



lower panels: the corresponding DR curves. It should be noted that BPFA fails to take effect on the signals in the sense that the method 
always obtains RE values larger than 0.3 and zero DR values. 
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Fig. 4. (a)(c): the final RE and DR values obtained by the five methods in 11 experiments of the first series. (b)(d): The mean RE and 
DR values obtained by the five methods in the first series of experiments. The numbers 1-5 in the horizonal axis stand for the K-SVDp^, 
K-SVDp^, MOD, BPFA, and GDL methods, respectively. The average RE value of the BPFA experiments is larger than 0.2 and hence 
cannot be observed in (a). The gray bar in (b) means that the BPFA attains DR value and thus cannot take effect on the signals. 
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Fig. 5. (a)(c): the final RE and DR values obtained by the five methods in 11 experiments of the second series. (b)(d): The mean RE and 
DR values obtained by the five methods in the second experimental series. The numbers 1-5 in the horizonal axis stand for the K-SVDp^, 
K-SVDpg, MOD, BPFA, and GDL methods, respectively. The average RE value of the BPFA experiments is larger than 0.25 and hence 
cannot be observed in (a). The gray bar in (b) means that the BPFA attains DR value and thus cannot take effect on the signals. 



out of 6 depicted cases) or the second largest values among the five methods after 10 iterations. Moreover, 
by observing Fig. Uc) and Fig. Oc), the proposed algorithm apparently yields the most stable DR values 
among the five methods, and successfully detects more than 90% atoms of the original dictionary in each 
of the experiments. In specific, the algorithm attains the largest (6 out of 11 experiments in the first series 
and 9 out of 11 experiments in the second series) or second largest DR values in all experiments, and 
achieves the largest average DR values in both experimental series, as shown in Fig. Ud) and Fig. Od). 
This substantiates the good capability of the GDL algorithm in recovering the original dictionary. 

In the next section, we further verify the effectiveness of the proposed method on image reconstruction 
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from data with more complex intrinsic structures and more complicated nonhomogeneous noises. 

B. Real image experiments 

A series of test images of512x512or 256 x 256 pixels were utilized for the image reconstruction 
problems. These images were generated by combining 6 gray- scale images, all of which are widely used 
in the image processing literature [33], with different levels of nonhomogeneous noise. In our experiments, 
four types of nonhomogeneous noise were employed: (1) nonhomogeneous Gaussian noise with extent 5\ 
the standard deviation of the Gaussian noise increasing uniformly from for lower-right pixels to 5 for 
the upper-left pixels across the image; (2) salt-pepper noise with extent p\ corrupting the image with p 
percentage of dead pixels with either maximum or minimum intensity values; (3) mixture of homogeneous 
Gaussian and salt-pepper noise, with extents a and p, respectively; (4) mixture of nonhomogeneous 
Gaussian and salt-pepper noise, with extents 5 and p, respectively. 

The dictionary was trained on the overlapping patches, of 8 x 8 pixels (i.e., the input samples are 
with dimension d = 64), from a noisy image in each experiment, and thus each experiment included 
n = (256 — 7)^ = 62, 001 patches (all available patches from the 256 x 256 images, and every second 
patch from every second row in the 512 x 512 images). In each of the experiments, the dictionary D 
contained m = 256 atoms. For each dictionary learning method, the images were rebuilt by averaging the 
overlapping reconstructed patches over the dictionary attained by the method. 

Three of the current methods, DCT [7], K-SVDp^ [9], and BPFA [lljj, were also applied to these 
images for comparison. The dictionary utilized by the DCT method is the overcomplete DCT bases, 
while dictionaries of K-SVDp^, BPFA and GDL, were trained from the images. The randomly selected 
image patches were used as the initialization of the K-SVDp^ and GDL methods, and the singular- 
value-decomposition-based initialization was used for BPFA. The K-SVDp^, BPFA, GDL results were 
all obtained by 10 iterations. Since both DCT and K-SVDp^ need to preset the parameter that evaluates 

^The results of the K-SVDp^ and MOD methods are omitted since their performance are comparatively less than the other adopted 
methods in the images. 
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the mean noise deviation of the entire image pixels, we implemented both methods 10 times on each 
test image under different initializations of this parameter and only recorded the best one as the final 
result. In comparison, the proposed GDL algorithm was implemented only once under K = 15000 for all 
experiments. The results are quantitatively measured by the PSNR value (with unit dB) defined by 

PSNRif) = 10 logio^^^T^' 

where / and denote the original and the reconstructed images, respectively. Besides, by looking through 
the numbers of atoms required for representing the image patches (i.e., the numbers of the non-zero 
elements in the corresponding representation coefficients) and averaging the results over the entire image, 
an atom-using-frequency figure, of the same resolution as the original image, can be achieved. For the 
atom-using-frequency figures so constructed, the brighter is a pixel, the more atoms are assigned to 
represent the image patches containing the pixel, and thus the more emphasis is placed on the region 
around the pixel by the corresponding method, and vice versa. Therefore, such an atom-using-frequency 
figure can qualitatively reflect the intrinsic image structure explored by the utilized method. 

Table IH lists details of the PSNR results of the reconstructed images. Figs. [6l|9] depict the recovered 
images, along with their PSNR values obtained by applying the DCT, K-SVD, BPFA, and GDL methods to 
four typical test images mixed with different types of noises, respectively. The corresponding dictionaries 
and atom-using-frequency figures attained by these methods are also displayed in these figures for easy 
evaluation. 

The advantages of the proposed method can easily be observed in these results. First, our algorithm 
best rebuilds the original images among all employed methods. Specifically, as compared with the DCT, 
K-SVD and BPFA methods, our method achieves the largest PSNR values in almost all of the experiments, 
and on average, our obtained PSNR value are the largest among those of the employed methods for each 
image experiment. This advantage can also be visualized in the first rows of Figs. [6l|9l The noise is most 
prominently removed from the noisy images, e.g. the bookshelf of the Barbara image in Fig. [6l by our 
method, and the details of the original image are mostly recovered, e.g. the windows of the House image 
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TABLE I 

Summary of the PSNR results of the image experiments. In each cell, the upper-left, upper-right, lower-left, and 
lower-right values refer to the dct, k-svd, bpfa, and gdl methods, respectively. the best result in each cell is 
highlighted. the last row is the average results ofthe four methods over all noise cases for each image. 
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Fig. 6. Results on the Barbara image mixed with nonhomogeneous Gaussian noise with extent 61.32. The panels from top to bottom: 
the reconstructed images, the dictionaries, and the atom-using-frequency figures obtained by the DCT, K-SVD, BPFA, and GDL methods, 
respectively. 
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Fig. 7. Results on the Peppers image mixed with salt-pepper noise with extent p = 10. The panels from top to bottom: the reconstructed 
images, the dictionaries, and the atom-using-frequency figures obtained by the DCT, K-SVD, BPFA, and GDL methods, respectively. 
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Fig. 8. Results on the Boat image corrupted by mixture of homogeneous Gaussian noise and salt-pepper noise with extents a = 20 and 
p = 5. The panels from top to bottom: the reconstructed images, the dictionaries, and the atom-using-frequency figures obtained by the 
DCT, K-SVD, BPFA, and GDL methods, respectively. 
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Fig. 9. Results on the House image corrupted by mixture of nonhomogeneous Gaussian noise and salt-pepper noise with extents 6 = 51.10 
and p = 4. The panels from top to bottom: the reconstructed images, the dictionaries, and the atom-using-frequency figures obtained by the 
DCT, K-SVD, BPFA, and GDL methods, respectively. 
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in Fig. [9] (the details can better be seen by zooming in onto the images in the computer). These results 
show the excellent capability of the proposed algorithm in reconstructing the original images. 

Second, the proposed method more robustly attains the proper dictionary underlying the images as 
compared with the other dictionary learning methods. This can easily be observed in the second rows 
of Figs. [6l|9l It is evident that the dictionaries attained by our method capture more meaningful features 
underlying the images and are least affected by the noise in all cases. These results validate the capability 
of the proposed algorithm in properly generating the dictionary for images corrupted by nonhomogeneous 
noises. 

Third, the atom-using-frequency figures obtained by our method faithfully reflect the intrinsic structures 
underlying the images. Particularly, from the third rows of Figs.[6l|9l it can be observed that the atom-using- 
frequency figures obtained by the proposed method clearly depict the basic edge information underlying 
the images. This is due to the fact that the related patches are of relatively complicated structures, 
and our method can adaptively assign more atoms to represent these image patches. In comparison, 
such meaningful structures are not so noticeably detected by the atom-using-frequency figures of the 
other methods in the experiments. These results demonstrate the capability of the GDL method in 
detecting major/meaningful structure information while at the same time eliminating the minor/useless 
noise information underlying the images at the global scale. 

V. Conclusion 

In this paper we have proposed a novel dictionary learning method. Instead of enforcing similar sparsity 
constraint on each input sample like the current methods, the new method imposes global sparsity constraint 
on all training data samples, which makes the new method capable of optimally assigning atoms for 
representing the various samples and adapting to the intrinsic data structures at the global scale. An 
efficient algorithm has also been correspondingly developed, which is easy to be implemented based on 
the sparse coding and sparse PCA techniques, and is expected to converge to a locally optimal solution 
of the problem. Based on the experimental results on a series of signal and image data sets, it has been 
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substantiated that as compared with the current dictionary learning methods, the proposed method can 
more faithfully deliver the ordinary dictionary and properly reconstruct the input samples. Besides, it has 
been theoretically analyzed and empirically verified that by utilizing the proposed method, the atoms of 
the dictionary can be optimally adapted to represent samples with various intrinsic complexities, and the 
frequency of atom-using can facilitate revealing the intrinsic structure underlying the input samples. 

Multiple problems still need to be further investigated in our future research. For example, the effective- 
ness of the proposed algorithm requires to be further testified in real signals and images with complicated 
noise types, e.g., the poisson noise. Besides, qualitatively speaking, the more complex is the entire structure 
or the less noise is contained in the input data, the larger the global sparsity parameter K should be properly 
preset. Further investigation, however, still needs to be made to design an automatic quantitative parameter 
selection strategy to further improve the quality of the proposed method. Furthermore, research is needed 
to further improve the efficiency of the proposed algorithm by virtue of the online [22] or convexification 
[12] techniques. 



Appendix: Proof of Theorem 1 



Theorem 1: The optimum of the model 



min \\Ei — di{a\) 




s.t. 
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can be attained by solving the optimization problem 



max uFeJEiw 



s.t. w^w 1, \\w\\q < 
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in the sense of 



EiW 



(5) 



EiW 



where di, S[ are the optima of ([3]), and w is the optimum of di]), respectively. 
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Proof: 

(i) For the optimal solutions rf^, S[ of dS]), it holds that 



di 



\Eial\\2 



Proof of (i): The objective of (|3]) can be equivalently transformed as: 

\\Ei-di{a\Y\\l = ixm-dMY){Ei-dMYf) 

= ix{EiEj - Eialidif - diialfEf + dMfal{dif) 
= ix{E,Ej) + {alYa\ - 2{E,a\fd,. 

By fixing a\ as the optimal solution S[ of ([3]), it is easy to deduce that the optimal di of ([3]) can be 
obtained by solving the reformulated optimization problem: 

max {Eia\Ydi s.t. djdi = 1. (7) 

di 

That is, the optimal di of d?]) should be the unit vector with minimal inner product with the vector EidF-. 
It is then easy to attain that di should be parallel to the direction of Eia\, i.e., 



di 



\E^al\\2' 



The proof of (i) is then completed, 
(ii) It holds that 



Proof of (ii): Denote w = -rrSh- and p= ||S[||2. It is then easy to deduce that p is the optimum of the 

W^i l|2 

following optimization problem: 
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since for any value of p, 



Ei-pl diW 



E,-d,{a\f 



< 



Ei - (di{pwf 



where = pw is also a vector with /c[ non-zero entries. Since 
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based on (i), it is easy to deduce that 
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The proof of (ii) is completed. 

(iii) The model ([3]) can be solved through the optimization problem (|4]) in the sense of the transformations 

o. 

Proof of (iii): Denote the vector w as the normalized i.e., 



w 



i 112 



Based on (i) and (ii), we have 



2= 



\Ei^l\\2 



EiW 
\Eiw\\^ 



(8) 



and 
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(9) 



The minimum of the objective function of ([3]) can then be equivalently transformed as 



E^ - d,{a\f 



\Ej — Ejww 



(10) 
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We further transform ([TOb as follows: 

\\E,-E,ww^\\l = iv{{E,-E,ww^){E,-E,ww^f^ 
= ir{EiEj) - ir{EiWw^Ej) 
= \\E,\\l-w^EfE,w. 

Thus, the optimum of dS]) can be attained through the optimal solution of the following optimization 
problem 

II 1 1 2 'V"' 'V"' II II 

min \\Ei\\j^ — w El EiW s,t w w = l,\\w\\r. < 

w 

in the sense of the transformations dS]) and ([9]). Since \\Ei\\ is a fixed constant for optimization, we can 
equivalently reformulate the above optimization problem as (|4]). 
This thus completes the proof of the theorem. 

■ 
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